Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

@gabe-l-hart gabe-l-hart commented May 29, 2025

Dependencies

Description

This PR replaces my initial (more generic) attempt in #13276. Following Georgi's comment, this version is explicit about owning two child caches, one unified and one recurrent. This greatly simplifies the logic and avoids the need for additional interface changes to support managing child caches in the abstract.

This PR has been verified to be minimally functional in #13550 (Granite 4 / Bamba).

@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch 2 times, most recently from 256f1b7 to 9d05381 Compare May 30, 2025 08:22
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from 9d05381 to 2b984f4 Compare May 30, 2025 08:29
@ggerganov
Copy link
Member

Sorry about the massive changes, hopefully we are almost converging to the final version of the implementation. See #13746 (comment) for the latest short-term plan.

@gabe-l-hart
Copy link
Collaborator Author

No problem, and thanks for the quick work getting these pieces overhauled! I'll plan to keep this branch up to date as the changes roll in so that it will hopefully be ready for review quickly once its turn comes.

@gabe-l-hart gabe-l-hart mentioned this pull request May 30, 2025
3 tasks
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
…s in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
…l is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
The implementation of the hybrid cache intentionally does not specify the
types of the child caches, so there was a naming mismatch with these
predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart <[email protected]>
This will be needed by other cache types as well, so centralizing the
definition will make it more reusable.

Branch: HybridCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: ggml-org#13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
This includes a refactor of the create_memory logic to avoid needing to use
the arch enum explicitly unless a model needs explicit cache instantiation
logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart force-pushed the HybridRecurrentCache branch from 089c968 to 4a2709f Compare May 30, 2025 15:38
@ggerganov ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from f23e4cc to 71619f2 Compare May 31, 2025 07:05
@ggerganov ggerganov deleted the branch ggml-org:gg/kv-cache-simplify-part3 May 31, 2025 07:24
@ggerganov ggerganov closed this May 31, 2025
@gabe-l-hart
Copy link
Collaborator Author

Since this was auto-closed when #13746 merged, I've rebased and reopened as #13979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants